Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger
نویسندگان
چکیده
Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rulebased Finnish Semantic Tagger, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rulebased NE tagger, FiNER.
منابع مشابه
Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910
Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In ...
متن کاملA semantic tagger for the Finnish language
This paper reports on the current status and evaluation of a Finnish semantic tagger (hereafter FST), which was developed in the EU-funded Benedict Project. In this project, we have ported the Lancaster English semantic tagger (USAS) to the Finnish language. We have re-used the existing software architecture of USAS, and applied the same semantic field taxonomy developed for English to Finnish....
متن کاملPorting an English semantic tagger to the Finnish language
Semantic annotation is an important and challenging issue in corpus linguistics and language engineering. While such a tool is available for English in Lancaster (Wilson and Rayson 1993), few such tools have been reported for other languages. In a joint Benedict project funded by the European Community under the ‘Information Society Technologies Programme’, we have been working towards developi...
متن کاملOld Content and Modern Tools - Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910
Kimmo Kettunen, Eetu Mäkelä, Teemu Ruokolainen, Juha Kuokkala and Laura Löfberg 1 National Library of Finland, Centre for Preservation and Digitization, Mikkeli, Finland [email protected] 2 Aalto University, Semantic Computing Research Group, Espoo, Finland [email protected] 3 National Library of Finland, Centre for Preservation and Digitization, Mikkeli, Finland teemu.ruokolainen@h...
متن کاملImproving Finite-State Spell-Checker Suggestions with Part of Speech N-Grams
We demonstrate a finite-state implementation of context-aware spell checking utilizing an N-gram based part of speech (POS) tagger to rerank the suggestions from a simple edit-distance based spell-checker. We demonstrate the benefits of context-aware spellchecking for English and Finnish and introduce modifications that are necessary to make traditional N-gram models work for morphologically mo...
متن کامل